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EVALUATION 


The  objective  of  this  program  was  to  develop  a real-time  and  speaker 
independent  algorithm  for  recognizing  a small  vocabulary  of  English 
words  spoken  in  a natural  and  unconstrained  manner.  The  algorithm 
automatically  extracts  key  parameters  from  a preamble  (short  phrase 
different  from  the  vocabulary  words)  which  initiates  a speaker 
transformation  which  minimizes  both  inter  and  intra  variations  to 
achieve  a speaker  independent  recognition  system. 

Overall  recognition  results  for  20  trained  and  17  untrained 
speakers  for  a vocabulary  consisting  of  the  connected  digits  plus  the 
word  "point"  spoken  in  strings  of  1,  2,  3 words  long,  lbO  words  per 
speaker  were  97.3  and  86.0  percent  correct  recognition,  respectively. 

This  capability  shall  imoact  on  future  keyword  and  language 
recognition  programs  and  have  practical  applications  in  data  entry, 
retrieval,  and  command  and  control  tasks. 

RICHARD  S.  VONUSA 
Project  Engineer 
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INTRODUCTION. 
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Programs  are  written  to  build  a speaker-  and  channel- indepen- 
dent connected  speech  recognizer  for  21  words  (digits  0 through  9 
and  point,  plus  10  command  words).  The  speaker  independence  is 
realized  on  the  basis  of  a preamble  not  containing  the  vocabulary 
. words.  The  recognizer  is  built  on  the  new  PDP  11/70  computer. 

The  evaluation  and  data  base  were  accumulated  using  new  recordings 
and  new  template  making  and  optimization  procedures.  The  real- 
time operation,  pitch  correction,  saturation  normalization,  and 
speed  normalization  efforts  necessitated  entering  into  an 
exploratory  mode  where  an  increasing  number  of  directions  had  to 
be  experimented.  The  development  time  was  extensive  because  some 
of  these  directions  did  not  work  out  satisfactorily  and  others 
resulted  in  a compromise  in  accuracy. 

1.  To  achieve  real-time  operation  an  attempt  was  made,  as 
soon  as  the  new  machine  became  operational,  to  replace  template- 
matching  by  spectral  correlation  with  a spectral  subtraction  pro- 
cedure (mean- error  minimization).  The  spectral  subtraction  method 
is  conputat ional ly  much  faster  than  correlation  since  it  involves 
no  multiplications.  However,  the  end  results  are  less  satisfactory. 

2.  Because  of  its  inherently  slow  running  speed,  a Fortran 
program  does  not  lend  itself  to  real-time  operation.  With  an 
array  processor  the  system  would  run  in  real-time  when  all  pro- 
grams arc  converted  into  machine  language.  This  is  very  costly 


in  time  and  money  and  was  not  pursued,  especially  since  all  the 
system  problems  are  not  satisfactorily  solved.  At  present,  the 
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time  window  is  2.S  seconds.  This  accomodates  at  most  7 connected 
digits  when  uttered  rather  fast.  Due  to  the  limited  memory  hand- 
ling capability  of  the  computer,  it  was  impractical  to  process 
utterances  longer  than  2.5  seconds.  This  time  window  is,  how- 
ever, sufficient  to  utter  7 digit  telephone  numbers  when  spoken 
fast.  At  normal  rates  of  speaking  5 to  6 digits  are  easy  to 
accommodate.  Otherwise  the  program  is  able  to  handle  an  unlimited 
nuntoer  of  words  in  a string. 

3.  Speed  normalization  (rate  of  speech  per  unit  time)  is 
implemented  by  adding  to  the  previous  time  normalization  a time- 
warp  algorithm  where  selected  samples  in  the  utterance  correlating 
with  the  successive  selected  samples  to  yield  a score  above  a pre- 
set value  of  correlation  parameter  are  skipped.  From  observations 
made  this  performs  reasonably  well.  This  speed  normalization 
requires  in  general  more  samples  in  the  utterances  than  in  the 
templates.  This  is  accomplished  by  sampling  the  utterances  in 
the  recognition  mode  more  frequently  than  in  the  template  making 
mode.  Templates  are  made  using  .92  as  the  sampling  parameter, 
while  during  recognition  the  data  is  sanpled  with  .96  as  the  sam- 
pling parameter.  Again  the  performance  is  quite  satisfactory , 
although  by  going  to  .92  value  of  the  sampling  parameter  some 
discrimination  loss  seems  to  have  occurred. 

4.  After  the  time-warp  algorithm,  the  editing  rules  had  to 
be  reexamined  and  the  program  updated.  A good  feature  was  that 
it  was  possible  to  reduce  the  extraneous  errors  (errors  of  the 
type  3 3,8  where  8 is  extraneous)  considerably.  As  a result 
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the  present  recognizer  makes  fewer  extraneous  errors. 

5.  Intensive  attaints  were  made  to  implement  pitch  correction. 
Early  in  time  a hardware  commercial  pitch  extraction  system  was 
purchased  and  tried.  It  extracts  the  pitch  onlv  from  long  stable 
regions  but  it  was  not  possible  to  extract  it  from  transition 
regions  and  near  voice- unvoice  regions  where  we  needed  it  most. 

After  some  effort  it  became  clear  that  it  would  not  work  satisfac- 
torily in  the  intended  way. 

6.  Originally,  the  contract  required  10  digits.  Later  the 
word  "point"  was  added  at  the  suggestion  of  the  agenev  for  reasons 
of  flexibility  in  handling  material  containing  digits.  The  word 
"point"  often  confuses  with  some  of  the  digits.  Tins  reduces  con- 
siderably the  accuracy.  Without  "point"  in  the  data  base  the 
accuracies  would  be  considerably  higher.  (The  largest  number  of 
errors  come  from  "point"  confusing  with  1 and  4.) 

7.  The  zero-crossing  circuitry,  a hardware  component , required 
an  additional  channel  which  was  not  available  in  the  new  machine's 
16-channel  data  acquisition  system.  .As  a result,  the  voice-unvoice 
feature  became  less  reliable.  The  voice-unvoice  feature  is  neces- 
sary’ in  averaging  over  the  skeletons  so  that  selected  samples  do 
not  get  out  of  step.  .An  automatic  averaging  process  was  tried  and 
the  results  were  not  completely  satisfactory.  The  speaker  indepen- 
dence was  then  pursued  in  the  direction  of  using  the  preamble  to 
identify  one  to  three  closest  speakers  in  the  data  base  and  using 
for  recognition  the  templates  of  these  speakers.  This  procedure  is 
workable  and  it  is  much  s impler  than  the  expansion  method,  both 
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in  the  creation  of  the  dat>>  base  and  in  updating  during  the  long- 
time adaptation  process. 

8.  To  achieve  speaker  and  channel  independence  the  concept 
called  for  averaging  over  large  numbers  of  vowels  to  obtain  the 
basis  functions  and  averaging  over  a large  number  of  skeletons 
(voiced  samples  with  voiced,  unvoiced  samples  with  unvoiced). 

Also,  skeletons  belonging  to  templates  taken  from  the  hard  set 
should  be  averaged  among  themselves  according  to  the  position  from 
which  they  are  taken.  A large  vocabulary  with  most  words  multi- 
syllable presents  a very  large  number  of  alternatives,  with  many 
representatives  from  each  position.  To  average  by  hand  and  by 
human  judgement  became  inpossible.  Attempts  were  made  to  do  so  auto- 
matically and  this  did  not  produce  satisfactory  results.  The  con- 
clusion here  seems  to  be  that  careful  averaging  over  a large  num- 
ber of  basis  functions  is  necessary  to  obtain  the  average  basis 
functions  and  averaging  over  a large  number  of  skeletons  is  neces- 
sary to  obtain  the  average  skeletons  across  speakers.  After  this 

is  done  some  sort  of  category  formation  is  also  needed  to  put  simi- 
lar dialects  and  accents  into  separate  categories.  With  the  time 
available  to  us  after  the  programs  became  operational,  this  was 
not  possible  to  carry  out. 

9.  The  short-time  adaptive  procedure  based  on  preamble  KEY 
SUE  FUR  SHOP  is  used  to  identify  the  closest  1-3  speakers  to 
the  unknown  person's  speech.  When  the  templates  of  these  1-3 
speakers  are  used  for  recognition  a short-term  adaptation  is 
achieved.  Thereafter  a long-term  adaptation  is  applied  during 
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which  a few  new  templates  are  added  from  a hard  set  of  digits  (such 
as  411,  383  etc.)  or  'ran  errors  that  occur  as  the  system  performs. 
Results  are  close  to  the  results  obtained  for  single  speaker  con- 
nected mode  recognition.  In  this  mode  singles  usually  perform  with 
none  or  very  few  errors.  The  templates  to  be  added  are  mostly  for 
the  connected  digits 

10.  The  complete  statistics  presented  are  a combination  of 
adaptive  recognition  (7  speakers)  and  single  speaker  recognition 
results  (37  speakers) . The  errors  and  their  analyses  are  given 
separately  in  the  results  section.  This  summarizes  the  test 
results  on  44  speakers  (men  + women) . It  is  to  be  stressed  that 
only  one  sample  of  a word  was  stored  from  each  speaker.  Time  did 
not  permit  each  speaker  to  train  on  10  or  more  utterances  of  each 
word.  Taking  into  consideration  all  the  limitations  under  which 
work  was  done,  the  end  results  are  quite  satisfactory  and  indicate 
that  the  speaker-adaptable  recognizer  of  connected  words  is  a real 
possibility  along  these  lines.  With  a CRT  at  his  disposal  a good 
speaker  can  be  trained  within  20-30  minutes  to  a consistency  of 
95-98%  accuracy  in  the  connected  mode.  Some  familiarity  with 
speech  patterns  and  how  to  make  templates  are,  however,  necessary 
since  full  automation  of  the  procedure  has  not  been  possible  at 

the  present  time.  The  results  should  be  viewed  under  the  considera- 
tion that  more  than  half  of  these  speakers  were  conpletely  un- 
trained and  were  unfamiliar  with  any  kind  of  recognition  task. 

11.  The  procedures  developed,  both  for  short-time  and  long- 
time adaptation,  are  independent  of  speaker  and  channel;  that  is. 
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the  procedures  are  applicable  to  both  equally  well.  The  filter 
system  bandwidth  is  essentially  very  close  to  the  telephone  band- 
with.  Because  there  was  no  array  processor  or  time  to  convert  the 
programs  into  machine  language  the  short-time  adaptation  process 
uses  a few  minutes  of  calculation  time.  However,  to  utter  the  pre- 
amble words  takes  less  than  10  seconds.  With  a fast  computation 
scheme  the  total  would  be  no  more  than  10  seconds.  The  long-time 
adaptation  and  updating  at  present  takes  50-60  minutes  because 
templates  are  made  by  hand.  With  automatic  template  making  this 
would  take  about  60  seconds,  depending  on  how  many  of  the  templates 
need  to  be  added  or  updated. 

12.  The  recognition  algorithm  is  independent  of  the  vocabu- 
lary chosen  or  the  phrase  to  be  recognized,  since  any  word  or 
phrase  can  be  used  or  added  to  the  vocabulary.  It  is  also  inde- 
pendent of  the  length  or  number  of  syllables  in  the  phrase,  since 
each  syllable  can  be  included  in  a string  of  syllables  which 
sequentially  represent  the  phrase  and  recognized  by  this  sequential 
representation. 

13.  The  present  performance  and  error  analysis  reflect  the 
major  source  of  errors  as:  a)  Voice-Unvoice  became  less  reliable, 
hence  the  large  number  of  errors  in  "six",  b)  The  word  "point" 
confuses  with  "one"  and  "four",  c)  More  than  half  of  the  speakers 
are  either  novices  or  have  dialects  and  accents. 

14.  In  the  near  future  it  is  planned  to  redo  the  basis  func- 
tions on  the  basis  of  averages  of  large  numbers  of  individual  basis 
functions  and  transform  with  these  averaged  functions.  It  is  also 
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planned  to  reintroduce  the  zero- crossing  feature  and  reexamine  the 
time-warp  algorithm.  In  this  way,  the  present  programs  are  likely 
to  be  made  to  work  effectively.  In  the  meantime  the  present  adap- 
tive algorithm  should  be  considered  satisfactory  since  it  performs 
reasonably  well  and  since  it  is  simple  to  implement  and  operate. 
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1.  RECOGNITION  EXPERIMENTS 

With  the  above  recognition  algorithms,  several  experiments 
were  performed  to  test  the  various  speaker -adaptation  options. 
Whereas  the  initial  efforts  were  directed  toward  implementing 
spectral  adaptation,  the  predominant  direction  was  speaker 
categorization. 

1 . 1 Spectral  Adaptation. 

Attempts  were  made  to  perform  spectral  speaker  adaptation 
(by  the  methods  discussed  in  Chapters  2 and  3)  using  adaptation  of 
the  voiced  parts  of  the  tenplates  with  4 vowel  functions,  and  sub- 
sequently using  6 and  8 basis  functions  (voiced  and  unvoiced)  to 
adapt  all  template  samples. 

1 • 1 • 1 • Experiment  To  Build  Up  an  Adequate  Master- 


Skeleton  File  with  4 -Vowel  Adaptation. 


a)  Goal.  To  amass  a minimal  data-base  of  skele- 
tons made  from  a limited  data-base  of  recorded  utterances,  and 
which  are  sufficient  to  perform  recognitions  of  single  digits  (0-9) 
when  a new  speaker  enters  his/her  basis  functions. 

b)  Basis  functions  used.  Basis  functions  are  5 instances 
per  speaker  of  I,  @,  A,  and  U from  the  utterance  "He  Had  Hot  Food" 
repeated  5 times.  Each  instance  of  a vowel  consists  of  16  fil- 
trates at  the  times  of  two  selected  sanples  at  the  peal  of  each 
vowel.  Basis  functions  are  stored  as  templates  in  .m  ordinary 
template  file.  They  are  selected  using  a program  that  edits  out 
all  selected  sanples  not  at  vowel  peaks;  this  program  should  be 
instrunental  in  automating  the  vowel -extract ion  process. 
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c)  Method  of  Making  Skeletons.  For  each  vowel  name  (I,  G, 

A,  U),  a 16-vector  basis  function  is  constructed  as  described  in 
Section  R.6.1.  F.ach  unvoiced  selected  sample  in  each  word-temp- 
late is  stored  directly  (without  alteration)  as  the  corresponding 
selected  sample  of  the  skeleton  of  the  word-template.  Fach 
voiced  selected  sanple  is  expanded  into  a skeleton  of  coeffi- 
cients which  are  stored  as  real  numbers  where  the  16  byte- fil- 
trates used  to  be  in  the  template  filo.  (Since  1 real  nunher  - 

4 bytes,  4 real  numbers  fit  exactly  into  16  bytes.)  The  skeletons 
are  now  devoid  of  spectra  of  the  speaker's  voice  (except  for  un- 
voiced regions). 

d)  Method  of  Reconstituting  Templates . The  new  speaker's 
vowel  basis  functions  are  obtained  as  in  Sections  B.6.1  and  b.6.3. 
F.ach  unvoiced  selected  sample  is  stored  directly  as  the  correspon- 
ding selected  saiple  of  the  reconstituted  word- template.  For 
each  voiced  selected  sample,  the  four  skeleton  coefficients  are 
multiplied  by  the  respective  basis  functions  of  the  new  speaker. 
The  sum  of  these  four  products  (a  16  byte-vector)  is  the  recon- 
stituted voiced  selectod  sample,  and  is  stored  where  the  skeleton 
coefficients  were  in  the  tenplate  file.  The  recons ti tut ed  temp- 
lates are  now  endowed  with  the  voice  spectra  of  the  new  speaker 
(except  for  unvoiced  regions). 

e)  Methods  of  the  Hxperiments.  Basis  function  template 
files  were  made  for  5 speakers  (UK,  LF,  DP,  EM,  MB),  and  thirty 
single-digit  utterances  in  random  order  were  stored  on  di'sk  for 
each  of  these  speakers.  Templates  for  the  single  digits  (0-9) 
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were  made  for  speaker  HK  (from  new  utterances  not  on  the  disk) . 

These  templates  were  expanded  into  skeletons , reconstituted  with 
HK’s  basis  functions,  and  then  run  in  a recognition  against  HK's 
30  single-digit  utterances.  Templates  were  made  from  representa- 
tive uttered  digits  missed  in  the  recognition  run.  (By  "represen- 
tative" is  meant  that  if  several  instances  of  "five"  are  missed,  only 
one  of  them  is  made  into  templates.)  These  templates  were  expanded  into 
skeletons,  added  to  the  existing  skeleton  file,  and  then  the  whole 
file  was  reconstituted  with  HK's  basis  functions.  This  was  repeated 
until  speaker  HK  had  no  error  in  the  recognition  run. 

Now  a new  speaker  was  introduced.  The  skeleton  file  was  recon- 
stituted in  terms  of  LF's  basis  functions  and  a recognition  was  run 
on  LF’s  utterances.  Skeletons  were  made  from  representative  utter- 
ances missed  in  recognition  via  LF's  basis  functions.  This  process 
was  once  again  iterated  until  no  errors  were  made.  New  people  were 
added  to  the  data  base  until  no  errors  occurred  in  the  next 
speaker's  utterances  when  run  against  a reconstituted  template  file 
made  only  from  skeletons  from  previous  speakers  (and,  of  course, 
reconstituted  via  the  new  speaker's  basis  functions).  Finally,  the 
skeleton  file  was  recycled  over  all  the  data-base  speakers  to  make 
sure  no  errors  had  accumulated  due  to  reconstituted  template  compe- 
tition. 

f)  Results.  The  sequence  of  single  errors  was  as  follows: 

HK  - 5 errors  initially- -removed  by  alternative  0,6,8 

LF  - 2 errors  initially- -removed  by  alternative  2,3 

DD  - 3 errors  initially- -removed  by  alternative  9,1,0 
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DM  - 3 errors  initially- - removed  hv  alternative  0,  9,  7. 

MB  - no  errors ; cycle  conplete 

Recycle  - no  errors. 

These  results  were  encouraging,  but  could  not  be  consistently 
obtained  in  subsequent  trials.  Errors  in  such  words  as  "six"  indi- 
cated that  unvoiced  regions  also  needed  spectral  adaptation. 
Therefore,  unvoiced  basis  functions  were  introduced  in  the  experi- 
ments that  followed.  Due  to  the  unreliability  of  voice-unvoice 
mentioned  before,  a larger  number  of  skeletons  from  different 
speakers  could  not  be  averaged  meaningfully.  Speaker  categori- 
zation was  therefore  implemented  on  the  basis  of  one  to  three 
closest  speakers. 

1.2  Speaker-Categorization  Adaptation. 

Although  the  tests  of  the  speaker- categorization  method 
of  adaptation  (described  in  detail  in  Section  B.  7)  may  properly  be  **- 
regarded  as  experiments,  they  are  not  interim  experiments  that  par- 
alleled and  instructed  the  development  of  the  recognizer.  Rather, 
they  are  tests  of  the  completed  recognition  algorithm,  performed 
in  an  "open  loop"  without  further  modification  of  the  algorithm. 
Therefore,  these  test  results  are  presented  separately  in  Section  2. 
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2.  RESULTS 

The  results  are  presented  in  two  stages.  The  first  stage  con- 
sists of  the  tests  that  were  performed  while  constructing  the  data 
base,  the  second  stage  is  the  speaker  independent  test.  The  results 
are  presented  in  tables  showing  the  performance  of  each  speaker. 

They  are  also  analyzed  using  confusion  matrices  and  normalized  error 
plots. 

-. 1 Data  Rase  and  Test  Material. 

The  material  for  the  test  consisted  of  strings  of  digits 
and  command  words  recorded  for  fortyfour  (44)  speakers,  thirtysix 
(3t>)  male  and  eight  (8)  female.  Each  speaker  was  given  a random 
list  of  digits  to  read  in  a connected  manner.  The  list  was  made 
in  such  a way  that  all  digits  had  an  equal  representation  but  in 
randan  combinations.  Each  speaker  read  50  single  digits,  20  utter- 
ances of  double  digits  (40  digits)  and  30  utterances  of  triples 
(90  digits)  for  a total  of  160  digits.  In  the  context  of  this 
experiment  the  digits  are  the  English  digits  "zero"  through  "nine" 
and  the  word  ’‘point".  In  addition,  for  the  purpose  of  data  base 
generation  and  speaker  adaptation  a second  set  of  recordings  was 
recorded  for  each  speaker.  This  recording  consisted  of  a set  of 
adaptation  utterances,  a single  repetition  of  the  digits  and  a 
"hard  set"  of  digit  utterances.  The  hard  set  consists  of  a subset 
of  the  digits  in  such  combination  that  they  present  problems  due 
to  coarticulation.  Each  speaker  read  the  same  list  (which  is 
presented  in  Section  3.7). 


2.2  Data  Base  Generation. 
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The  data  base  was  constructed  vising  29  male  and  8 female 
speakers,  fine  set  of  digits,  spoken  in  a discrete  fashion  from 
each  speaker  and  one  set  of  command  words  wore  used.  A set  of 
templates,  one  for  each  word,  was  made  for  a total  of  3?  sets  of 
templates.  This  was  the  preliminary  data  base  and  was  the  start- 
ing point  for  the  second  stage  of  data-base  updating.  Hie  updating 
was  done  on  the  "hard  set"  by  running  a recognition  test  using  the 
preliminary  templates.  The  errors  were  corrected  by  the  addition 
of  templates  until  the  "hard  set"  of  digits  reached  satisfactory 
performance.  This  procedure  was  repeated  3”  times  until  the  pre- 
liminary set  of  templates  for  each  speaker  was  updated  to  the  point 
of  acceptable  performance  on  the  "hard  set”  of  recordings . 

2.3  Performance  of  Data  Base  Speakers. 


To  test  the  performance  of  the  data-base  speakers,  a 
recognition  run  was  performed  on  the  random  utterances  of  each 
speaker.  Since  the  tenplatc  file  in  each  case  was  that  speaker’s 
file,  this  is  a test  of  a single  speaker  connected  word  recognition 
system.  The  object  of  this  test  was  to  evaluate  a practical  svs- 
tem  for  the  recognition  of  the  digits,  the  word  "point",  and  ten 
other  command  words  when  used  in  a simulated  data -entry  environ- 
ment. Even  though  the  system  is  a single- speaker  system,  it  can 
be  considered  practical  in  a multispeaker  env i ronment  since  onlv 
a single  repetition  of  the  vocabulary  read  in  a discrete  fashion 
was  used  for  initial  training. 
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The  results  are  summarized  in  Tables  1 through  4.  The  wide 
variation  in  performance  is  due  to  the  fact  that  most  speakers  (31 
out  of  37)  never  talked  to  a speech  recognition  system  before  and 
tended  to  slur  their  words  when  reading  connected  strings.  Table  1 
was  arranged  in  descending  order  of  performance.  It  is  clear 
that  for  the  top  20  speakers,  the  performance  is  substantially 
higher  than  for  the  remaining  17  speakers,  97.31  and  86.01  respec- 
tively. Tables  3 and  4 summarize  the  confusion  matrices  shewing 
the  errors  and  the  contribution  of  the  individual  words  to  total 
number  of  errors.  Each  confusion  matrix  contains  the  total 
number  of  errors  per  word,  the  number  of  extraneous  words  and  the 
number  of  rejections  per  word.  In  Tables  2 through  4,  a "?''  means 
the  word  was  missed,  an  "ex"  means  the  word  was  printed  as  an 
extra  word  and  the  "P"  denotes  the  word  "point".  The  words  that 
contributed  most  to  the  errors  were  the  words  "six",  "eight",  and 
"point".  This  is  true  for  the  top  20  speakers  as  well  as  the 
bottom  17  speakers,  indicating  that  the  voice/unvoice  categoriza- 
tion failed  for  both  categories  of  speakers. 

2. 4 Recognition  of  the  Control  Words. 

A set  of  10  control  words  was  entered  into  the  data  base 
using  the  same  37  speakers.  The  words;  enter,  retrieve,  continue, 
plus,  recall,  minus,  mistake,  backspace,  reset  5 stop,  w’ere 
stored  as  templates.  The  templates  were  based  on  a single  repeti- 
tion of  these  words  by  each  of  the  37  data  base  speakers  and  no 
additional  corrections  were  used.  To  test  the  performance,  each 
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No 


Speaker 


Sex 


No.  of  Frrors 


I Correct 


1 

RMA 

M 

0 

100.0 

7 

it 

IFE 

M 

0 

100.0 

3 

GAM 

M 

0 

100.0 

4 

LFE 

M 

7 

98.8 

5 

DCO 

M 

2 

98.8 

6 

AIX) 

M 

7 

*m 

98.8 

7 

BKE 

M 

3 

98.1 

8 

BBF 

M 

4 

97.5 

9 

IDE 

M 

4 

97.5 

10 

.JMLJ 

F 

4 

97.5 

11 

MBR 

M 

5 

96.9 

12 

EMC 

M 

6 

96.3 

13 

HNA 

M 

6 

96.3 

14 

MKD 

M 

6 

96.3 

15 

TSI 

M 

6 

96.3 

16 

jive 

M 

7 

95.6 

17 

HKE 

M 

7 

Q5.6 

18 

SKE 

F 

7 

t 

95.6 

19 

OCA 

F 

7 

95.6 

20 

WSA 

M 

8 

95.0 

21 

AFE 

M 

14 

91.3 

22 

FID 

M 

15 

90.6 

23 

RDA 

M 

15 

90.6 

24 

»1C 

M 

15 

90.6 

25 

BKE 

M 

17 

89.4 

26 

T5TT 

M 

17 

89.4 

27 

KOU 

F 

18 

88.8 

28 

HYI 

M 

19 

88.1 

29 

SE\’ 

M 

19 

88.1 

30 

RHA 

M 

19 

88.1 

31 

EMA 

M 

20 

87.5 

32 

PTE 

F 

87.5 

33 

NBI 

F 

26 

83.5 

34 

FKD 

M 

30 

81.3 

35 

SGR 

M 

32 

80.0 

36 

KSVr 

F 

40 

75.0 

37 

ECU 

F 

44 

’’2.5 

TABLE  1. 

A summary  of  recognition  results  for  a vocabulary  of  11 
words,  the  digits  and  the  word  '‘point".  The  table  is 
arranged  in  order  of  decreasing  performance. 
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Table  2. 


Confusion  matrix  for  37  speakers  male  and  female,  showing  the  errors 
for  connected  digits  and  the  word  point.  A total  of  5920  words  were 
used  in  strings  of  1,  2,  3 words  per  string.  Overall  accuracy  in- 
cluding errors  from  all  sources  was  92.1%. 


Table  3. 

Confusion  matrix  for  20  trained  speakers . The  connected  digit  plus 
the  word  point  were  spoken  in  strings  of  1,2,?  words  long,  160 
words  per  person.  Overall  accuracy  including  errors  of  omission, 
cornnission  and  extraneous  words  is  97.7>%. 


1 


Table  4. 


Confusion  matrix  for  17  untrained  speakers.  The  connected  digits 
and  the  word  point  were  spoken  in  strings  of  1,  2 , 3 words  long, 

160  words  per  pen-.cn.  Overall  accuracy  including  errors  of  omission, 
commission  and  extraneous  words  is  86.0*. 
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of  the  speakers  read  the  words  one  word  at  a time  in  random  order. 

In  each  case  the  templates  used  were  that  speaker's  own  templates. 

A sumnary  of  the  results  is  shown  in  Table  5.  The  results  are 
based  on  20  words  per  person  for  37  speakers  for  a total  of  740 
words.  TTiere  were  a total  of  15  errors,  13  of  which  were  rejection 
errors  for  a score  of  98.-0%  correct  recognition. 

2.5  Speaker  Adaptation  Test,  Using  Categorization. 

To  evaluate  the  performance  of  the  system  for  a set  of 
speakers  that  were  not  in  the  data  base,  the  following  procedure 
was  used:  Each  speaker  reads  the  preamble  "KEY  SUE  FUR  SHOP".  The 
system  performs  a correlation  of  the  preamble  with  all  preambles 
of  the  37  data  base  speakers.  The  highest  scoring  speaker  or 
speakers  are  chosen  as  representatives  and  their  templates  are 
used  as  a reference  vocabulary  for  recognition  purposes.  The 
algorithm  will  select  1,  2 or  3 categories  depending  on  a distance 
measure  among  the  top  3 candidates  in  the  data  base.  In  the  present 
system  each  of  the  37  speakers  represents  a category.  This  is  sub- 
optimal,  since  by  cross-correlation  of  the  data  base  a set  of  cate- 
gories can  be  found  which  will  reduce  the  nuntoer  of  categories  and 
make  them  more  representative.  The  test  was  performed  for  seven 
speakers;  six  speakers  from  PTC  and  one  speaker  from  RADC.  The 
RADC  speaker  (speaker  RV)  was  tested  as  part  of  the  final  demonstra- 
tion. 


The  results  are  shown  in  Table  6.  After  the  selection  of  a 
category  the  test  speaker  reads  the  "hard  set"  of  digits.  When  an 


- 20  - 


RECOGNIZED 


3 

H 

o. 

Enter 

Reset 

Stop 

Continue 

Backsoace 

Mistake 

Recall 

Retrieve 

I 

■ 

Plus 

i 

1 

Enter 

1 

Reset 

■ 

2 

Stop 

1 

■ 

Continue 

Backspace 

1 

Mistake 

2 

Recall 

4 

Retrieve 

— 

1 

Minus 

mm 

Tabic  5. 


Confusion  matrix  for  the  command  words  showing  the  errors  for  a 
data  base  of  37  spoakors.  There  were  15  errors  out  of  740  words 
for  a 98.0%  correct  recognition. 
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error  occurs,  the  templates  in  that  category  are  augmented  to  eli- 
minate that  error.  The  nunber  of  errors  in  the  "hard  set"  for 
each  speaker  is  shown  in  the  second  column  in  Table  6.  This  is 
also  the  nimber  of  additional  templates  that  were  added  to  that 
category.  The  speaker  then  reads  the  list  of  tost  utterances. 

Hie  nunber  of  errors  and  I correct  recognition  (columns  3 and  4) 
indicate  performance  after  the  "hard  set"  augmentation.  The 
test  material  was  based  on  data  recorded  for  the  seven  speakers 
following  the  same  procedure  and  the  same  data  structure  as  the 
material  for  the  other  37  speakers. 


Speaker 

No.  Iirrors 
'Hard  Set" 

No.  F.rrors 
Test  Set 

% Correct 
Recognition 

UK 

13 

7 

95.6 

MB 

3 

7 

95.6 

AD 

2 

y 

98.8 

LF 

16 

8 

95.0 

FK 

15 

5 

96.9 

HM 

7 

6 

96.3 

RV 

9 

6 

96.3 

Table  6. 

Results  for  7 Speakers  Using  the  Adapta- 
tion hy  Category  Method. 
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3.  CONCLUSIONS. 

The  results  of  the  present  investigation  indicates  that 
speaker  adaptation  by  way  of  category  formation  can  be  made  to 
work  with  a combination  of  short-time  and  long-time  adaptation. 
During  the  short-time  adaptation,  a preamble  not  containing  the 
vocabulary  words  is  made  to  select  a subset  of  templates  from 
the  overall  data  base.  The  new  speaker  usually  performs  with 
very  few  errors  on  singles.  On  doubles  and  triples,  errors  are 
higher.  During  a long-time  adaptation,  a few  of  the  templates, 
either  from  the  hard  set  of  connected  words  or  from  the  posi- 
tions of  errors,  are  corrected.  At  the  end  of  this  procoss  the 
speaker  performs  close  to  his  single  speaker  performance.  With 
the  short-time  adaptation  an  operator  can  start  using  the  machine 
and,  as  time  goes  on,  with  the  long-time  adaptation,  can  bring 
his  performance  to  98.8%  - 95%  accuracy.  The  latter  process  was 
carried  out  for  7 male  speakers.  It  is  noteworthy  that  for  these 
7 speakers  the  single  digits  plus  "point"  performed  with  99.05% 


correct  recognition. 
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APPENDIX  A. 

A.  CONCEPTUAL  BACKGROUND 

A. 1 Color  Analogue  in  Spectral  Adaptation  to  Speech. 

Our  algorithm  for  speaker- independent  speech  recognition 

begins  by  addressing  a more  specific  problem,  that  of  removing  the 
speaker  dependence  from  the  sound-energy  spectnm  of  a steady-state 
vowel.  The  problem  has  been  discussed  from  a psychophysical  point 
of  view  by  Yilmaz  (1967,  1968),  and  from  a pattern- recognition 
point  of  view  by  Yilmaz  et  al  (1976). 

From  a standard  acoustical  argimcnt,  a vowel's  sound -energy 
spectrun  can  be  expressed  as  the  product  of  an  energy  spectrum  I(\) 
from  the  vocal  apparatus  (including  larynx  and  cavity  resonators) 
and  a modulating  spectrum  R(X)  from  the  articulators.  We  hypothe- 
size that  most  of  the  speaker  dependence  of  the  vowel  resides  in 
the  vocal  apparatus,  and  the  modulating  spectrum  conveys  the  identity 
of  the  vowel. 

Thus  the  problem  of  representing  spoken  vowels  in  a sneaker- 
independent  way  becomes  analogous  to  that  of  recognizing  object 
spectral  reflectances  independently  of  the  spectrin  of  the  incident 
light.  The  vocal -apparatus  sound-energy  spectrum  1(A)  (an  excita- 
tion function)  is  analogous  to  the  illuminant  energy  spectrum,  and 
the  modulation  R(A)  from  the  articulators  is  analogous  to  reflect- 
ance. The  analogy  is  particularly  clear  when  we  view  selective 
reflection  of  light  from  a surface  as  two-way  (entering  and  leaving) 
transmittance  of  the  light  through  the  translucent  layer  constituting 
the  surface.  Viewing  spectral  reflectance  as  a transmittance 
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clarifies  the  physical  analogy  with  the  selective- transmittance 
properties  of  the  articulators  in  the  vocal  tract. 

As  is  the  case  with  reflectance  spectra,  we  also  hypothesize 
that  a typical  articulator  modulating  spectrum  tends  to  he  slcwlv- 
varying  in  wavelength,  so  it  can  be  approximated  by  an  expansion 
in  terms  of  four  basis  functions  rk(X)*: 

4 

R.(X)  = l a.,r.(X)  (2-1) 

i k-l  1K  K 

The  same  speaker- independent  coefficients  (called  the 
skeleton  of  the  vcwel)  characterize  the  expansion  of  the  vcwel  - 
sound  filtrates  through  the  16  filters  q - (X ) of  the  FTC  recog- 
nizer: 

v°  • £ “ik  vt)  i2-7) 

where  F..(I)  = Jq.I  R.dX,  and  f,.(I)  = /q.I  r.dX . 

1J  j 1 KJ  J K 

The  recognition  proceeds  as  follows:  Calculate  the  skeleton  of 
vowel  i using  the  filtrates  F^(I)  for  this  vowel  and  filtrates 
fj-(I)  for  the  basis  vowels  k,  all  spoken  by  a control  speaker  I. 
Then  have  a new  speaker  J say  a preamble  in  which  the  basis-vowel 
filtrates  are  identified  and  recordod,  and  construct  an  estimate 
of  vowel  i said  by  the  new  speaker.  In  order  for  a candidate  vowel 
spoken  by  J to  be  recognized  as  the  vowel  i,  its  filtrate  16- vector 

* The  number  4 was  arrived  at  by  finding  empirical lv  that  four 
expansion  functions  are  sufficient  to  construct  intelligible 
speech  in  a vocoder;  further  expansion  functions  add  onlv  pro- 
sodic qualities  to  the  reconstructed  signal. 
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(later  to  be  reduced  to  4 independent  expansion  functions)  must 
match  the  estimate  foT  vowel  i obtained  by  the  above  adaptive 
method.  In  this  way , an  arbitrary  vowel  can  be  recognized  by  a 
new  speaker  without  a prior  utterance  of  that  vowel  by  the  new 
speaker. 

In  the  above  procedure,  it  is  necessary  to  solve  Equations 
(2-2)  to  find  the  skeleton  cuk  of  vowel  i.  One  needs  only  four 
of  these  16  equations  to  find  the  skeleton:  Hie  system  is  over- 
determined. We  resolved  this  ambiguity  by  computing  the  a.,  that 

1J\ 

F. . 

11 


rendered  a least-squares  best  fit  of  t ct.,  f,  .(I)  to  F. . (I) 

IK  Kj  il 


(i.e.,  we  minimized 


16 

2 IF- (I) 
j=l  V 


S1  ai*  vni 


(2-3) 


by  the  standard  method.) 

The  least-squares  method  involves  solving  for  aik  the  equa- 


tions 


16 


16 


ih  Fijm  fKm  ' £ Vr)  V1”  1 ' 1-4 

which  is  equivalent  to  using  as  audio  response  functions  the  basis 
functions  fj£(l)  of  the  speaker  in  question.  If  these  basis  func- 
tions are  spiky  in  their  spectra,  however,  small  errors  in  their 
assessment  will  be  magnified  by  letting  them  play  the  role  of 
response  spectra  in  determining  the  skeleton.  The  computed  skele- 
tons may  be  more  stable  with  respect  to  has is -function  errors  if 
we  use  smoothly- varying  orthogonal  functions  as  the 
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response  functions  of  the  system.  For  example,  in 

^ • * 

place  of  in  the  above  equation,  we  might  insert  1,  sin(-— £■), 

cos(gi),  sin(y^p-) . These  functions  are  similar  to  the  response 
functions  in  the  color  theory  that  motivated  this  approach.  (Ortho- 
gonality is  not  absolutely  necessary  in  the  theory,  but  aids  the 
computational  accuracy.) 

A.  2 Assumptions  Particular  to  .Speech. 

So  far,  we  have  assumed  that  the  speaker -dependent  driv- 
ing function  I(X)  is  constant  in  time  for  a single  speaker.  How- 
ever, since  loudness  can  change  from  one  word  to  the  next  in  a 
speaker's  utterance,  it  is  more  reasonable  to  relax  this  assunption 
so  that  the  driving  function  can  vary  in  amplitude  with  time,  but 
remains  constant  in  relative  spectral  composition.  Thus  for  two 
speakers  I and  J,  I(X,t)  * gT  (t)  I (A ) and  J(X,t)  « gj(t)J(X).  (Note 
that  even  this  relaxed  assumption  does  not  yet  take  into  account 
pitch  variations  within  a single  speaker's  utterance.) 

Suppose  tj^,  tpj.  are  the  times  at  which  vowel  k is  extracted 
from  the  utterances  of  speakers  I and  J.  Similarly,  let  a test 
vowel  V be  uttered  by  the  two  speakers  at  times  t^r,  t^,,  respect- 
ively. Then  the  filtrates  (j*l,  16)  measured  by  the  FTC  recog- 


nizer for  vowel  V will  be 


U)  £ y* . r rn 

k-il) 


(2-4) 


and  similarly  for  speaker  J with  I replaced  by  J. 


Without  access  to  the  energy  envelopes  gj(t)  and  gj(t) , one 
cannot  directly  infer  the  skeleton  coefficients  oy^  from  the  fil- 
trates Fyj(I),  Fyj  (J) , fkj  (I) , fkj(J).  One  gets  only  the  quan- 


tities 


gI(tIV)  _ 


JI^IkJ 


(2-5) 


and  similarly  for  speaker  J. 


We  cannot  solve  this  problem  by  artificially  normalizing  the 
filtrates  in  each  16-vector  spectrum  (as  in  peak  normalization) , 
for  the  normalization  factors  will  not  generally  compensate  for 
the  loudness  changes.  (After  all,  the  envelope  to  be  normalized 
is  on  the  driving  function,  not  on  the  peak  amplitudes  of  the  prod- 
uct spectra  corresponding  to  the  basis  vowels.)  Theoretically,  we 
could  resort  to  finding  the  envelope  ratio  between  speakers  directly 
by  taking  the  ratio  between  basis -function  spectra  corresponding  to 
the  same  vowel  name  from  the  two  speakers.  However,  the  PTC  fil- 
ters are  fairly  broad-band  compared  to  the  spectral  transitions  of 
characteristic  speech  sounds  at  any  given  time.  Thus  such  a ratio 
can  be  done  only  on  paper,  not  by  the  machine.  Making  the  PTC 
filters  narrow-band  will  only  degrade  their  time  resolution,  so 
this  problem  is  significant  and  not  an  accidental  property  of  the 
present  recognizer.  Thus  we  have  to  adopt  an  approach  that  allows 
the  16  PTC  filters  to  sum  the  spectrum  before  further  processing  is 
done  on  the  resulting  filtrates.  The  basis-function  approach  can 
be  made  to  do  this  as  follows: 

Define  a fiducial  spectrum  for  each  speaker  corresponding  to  a 
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known  vowel  that  is  not  one  of  the  basis  functions.  Denote  it  bv 
subscript  0,  so  that 


vx>  4 

W'wW'kj"’ 

and  similarly  for  speaker  J. 


(2-6) 


Then,  as  before,  one  can  infer  by  least -squares  best  fit  the 
quantities 


V1) 


*0k  gI^tTo') 
gI(tIk} 


(2-7) 


:ind  similarly  for  speaker  J. 

Before  we  proceed  further,  we  note  that  each  of  the  spectra  we 
are  approximating  will  be  peak -normalized  (scaled  so  the  maximum 
component  is  255)  before  being  con^ared  to  similar  spectra  in  an 
unknown  utterance.  Therefore,  we  can  without  loss  of  generality 
assume  any  factor  we  want  to  scale  the  spectra  Fy.(I),  FVi(J).  Tn 
particular,  we  can  assume 


M'toVMDv'  ■ 1 - M'joVSjCtjy) 

so  that 

*»(» , v 

B^HT  s^TTT  * ^ C2-«l 

i s speaker- i ndependent . 

Suppose  now  that  we  use  the  utterances  of  speaker  I to  compute 

80k(I)’  ®Vk(T)-  Then’  after  findinS  fki(J).  BokGI)  for  a new 
speaker  J,  we  can  predict  (up  to  a scaling  factor)  the  spectrum 


* 
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for  V spoken  by  J from  the  relation: 


VJ) 


I 

k=l 


B0k(j)  fkjCJ) 


(2-9) 


This  is  the  theor>'  of  speaker- independent  vowel  reconstruction  so 
far  as  we  see  it.  The  Sq^CI)  is  introduced  for  compensating  inten- 
sity variations  during  basis- function  extraction.  In  the  process 
of  optimizing  the  recognition  performance  of  the  word- recognizer, 
it  has  been  expedient  to  include  unvoiced  sounds  as  well  as  vowels 
in  the  expansion;  we  settled  on  the  sounds  I,  E,  A,  U,  S,  $ for 
later  experiments,  but  earlier  experiments  just  used  I,  0,  A,  U to 
reconstitute  voiced  sounds.  In  all  experiments,  basis- functions 
were  obtained  from  spectra  averaged  over  from  3 to  5 utterances  and 
this  should  be  extended  to  at  least  10  utterances  for  purposes  of 
greater  smoothness  of  the  basis- functions . 

We  are  now  developing  a version  of  the  recognition  algorithm 
that  incorporates  the  basis -function  normalization  via  Pqj.(0*  We 
believe  that  incorporating  our  understanding  of  basis- function 
normalization  will  significantly  improve  the  recognition  results 
on  adapted  word  templates.  In  the  future,  we  will  also  average 
the  skeletons  from  different  utterances  to  remove  any 

residual  speaker- dependences.  Up  to  now,  we  have  had  difficulty 
performing  such  averaging  because  of  a voice/unvoice  detector  which 
could  not  be  relied  upon  to  determine  voice/unvoice  boundaries  from 
which  selected  speech  samples  could  be  sequentially  averaged. 


i 
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APPENDIX  B. 

B.  METHODS  OF  IMPLEMENTATION 

In  this  section,  we  shall  describe  the  operation  of  the  recog- 
nition system  developed  and  tested  by  Perception  Technology  Corpo- 
ration. 

B. 1 Initial  Signal  Processing. 

The  system  prepares  all  incoming  signals  for  further  pro- 
cessing in  the  manner  described  below: 

B.1.1.  Fixed  Interval  Samples. 

The  system  is  designed  to  process  an  utterance 
of  duration  2.5  seconds  at  each  entry.  The  length  of  this  time 
span  is  imposed  only  by  the  capacity  of  the  computer . During  the 
allowed  2.5  seconds,  approximately  five  single  syllable  connected 
utterances  can  be  entered.  The  signal  is  passed  through  a bank  of 
16  weighted  analog  filters  covering  the  range  of  audio  spectrum. 

A description  of  characteristics  of  these  filters  is  given  in 
Section  B.9.  The  system  is  triggered  manually  or  automatically 
before  each  entry.  Upon  triggering  the  system  starts  taking  read- 
ings of  the  16  filters  every  10  milliseconds,  giving  a maximum  of 
225  fixed  interval  samples.  (Automatic  triggering  in  a long  utter- 
ance proceeds  in  2.5  second  pieces  consecutively  staggered  back- 
wards by  0.5  seconds  to  encompass  vocabulary  words  eclipsing 
the  end  of  each  time  window.) 

Following  each  reading  of  the  16  filters,  calculations  are 
performed  on  the  data  during  the  10  millisecond  interval  before 
another  set  of  readings  is  taken.  Two  quantities  are  obtained  as 
follows : 

j 
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a)  A^,  sum  of  filter  amplitudes,  for  sample  number  i. 


16 

A.  * E f. . 

1 j=-< 


where  is  the  amplitude  of  the  output  from  the  j th  filter  when 
sample  i was  taken.  Preliminary  energy  normalization  scales  all 
the  energies  so  that  the  maximum  over  a 2.5  second  window  is  set 
equal  to  5000  (a  maximum  convenient  for  integer  storage  of  the 
energy  values) . Referring  to  Figure  1A,  i denotes  the  sequential 
numbers  on  the  left-most  column  and  A^  is  represented  by  the  posi- 
tion of  the  symbol  *. 

b)  R^,  square  root  of  the  sum  of  squares  of  f^ , is  given  by 

k - t‘l  ifij2)1'2 


j=l 


ij 


R^  is  needed  for  subsequent  correlation  calculations.  The  numbers 
f i j , Ap  R^  are  all  stored  in  memory. 

B.1.2.  Noise  Level  and  Detection  of  Beginning  and  End 
of  Signal. 

Silence  portions  are  monitored  for  the  purpose 
of  determining  the  noise  level.  The  noise  level  is  taken  to  be  the 
time  average  of  noise  energy  in  the  channel  during  the  absence  of 
speech.  The  threshold  level,  V^,  for  a signal  is  defined  as  1.5 
times  the  average  energy  of  the  first  ten  noise  samples: 


16  16 

V = 1.5  [ E E f..]/10 
j-1  j-1  1J 
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During  microphone  input,  the  program  notes  the  first  super- 

■ 

threshold  sample  in  the  2.5-second  window,  and  continues  processing 
until  a sample  with  a sub- threshold  energy  occurs.  If  this  sample 
is  less  than  6 samples  after  the  first  siperthreshold  sample,  the 
program  looks  for  another  first  superthreshold  sample.  Otherwise, 
the  program  marks  the  first  subsequent  instance  of  a superthreshold 
sample  followed  by  .4  seconds  of  subthreshold  energy.  At  this  point, 
the  samples  extending  from  .2  seconds  prior  to  the  first  superthresh- 
old sample  to  .4  seconds  after  the  final  superthreshold  sample  are 
included  in  the  utterance  to  be  processed  subsequently. 

As  an  example,  in  Figure  1A,  the  beginning  sample  No.  is  57 
and  the  ending  sample  No.  is  214.  The  whole  utterance  to  be  pro- 
cessed lies  between  these  two  samples  and  can  contain  many  syl- 
lables . 

B.1.3.  Voice  and  Unvoiced  Classification. 

Contained  among  the  fixed  interval  samples  are 
those  representing  vowels,  consonants  and  gaps,  each  of  which  is 
to  be  classified  as  either  voiced  or  unvoiced.  A sample  i is 
classified  as  unvoiced  if  the  following  two  conditions  are  simul- 
taneously satisfied: 

a)  The  energy  A^  of  the  system  is  less  than  32%  of  the  maxi- 
mum energy  in  the  2.5  second  utterance. 

b)  The  ratio  of  the  summed  energies  of  the  4 lowest- fre- 
quency filters  to  the  summed  energies  of  the  4 highest- frequency 
filters  is  less  than  or  equal  to  .6.  This -algorithm  is  the  soft- 
ware replacement  of  the  zero-crossing  criterion  used  in  the  pre- 
vious contract. 
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B.1.4.  Peak  Normalization. 

In  connected  speech,  the  voiced  maxima  have 
varying  anplitudes  even  within  a 2.5  second  window,  reflecting 
changes  in  loudness  as  the  speaker  talks.  In  order  to  have  a 
meaningful  index  of  the  rate  of  change  of  energy,  we  peak-norma- 
lized each  voiced  region  longer  than  6 time -samples.  The  normali- 
zation is  a quadratic  function  of  the  energy  G (A^ ) such  that  the 
voiced  maximum  A^^  maps  (always  up)  into  5000,  and  also 
G(0)  - 0 (Zero  energy  is  preserved.) 

G'  (0)»  1 (Attack  and  decay  rates  are  independent  of  the 
utterance's  peak  loudness,  when  evaluated  near  the  beginning  and 
end  of  the  peak,  at  energy  minima  of  nearly  zero  energy.  This 
reflects  a natural  property  of  speech  sounds.) 

Subject  to  these  constraints,  the  normalization  mapping  is 


G(A.) 


A? 

A.  + j-i 
1 \ 


ax 


,5000 

Amax 
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B . 2 Selection  of  Normalized  Samples. 

The  final  samples  are  selected  from  the  fixed- interval 
sanples  as  follows: 

B . 2 . 1 . Time  Normalization. 

The  basic  idea  of  time  normalization  is  to 
sample  the  speech  so  as  to  render  it  more  or  less  independent  of 
tho  rate  of  speaking.  This  is  usually  done  in  a crude  way  by 
taking  the  samples  only  after  a significant  amount  of  spectral 
change  occurs.  This,  however,  is  not  sufficient  because  the 
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intensity  , the  rate- of -change  of  intensity,  voicing  etc.  are  part 

of  the  recognition  criteria.  The  time -normalization  was  therefore 

improved  by  including  the  following  factors: 

a)  The  amount  of  change  in  spectral  shape: 

The  amount  of  spectral  change  between  two  samples  i find  i' 

is  measured  bv  their  correlation  C. . 

11 
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b)  The  change  in  energy: 

The  change  in  energy  between  two  samples  i and  i*  is 
measured  by  the  absolute  value 


c)  The  nature  of  the  signal  i.e.,  voiced  or  unvoiced. 

d)  The  level  of  signal  energy. 

e)  The  duration  of  the  signal. 

B.2.2.  The  Use  of  Parameters. 

Parameters  are  used  to  associate  the  factors  men- 
tioned above  with  a weight  and  they  can  be  readily  employed  to 
evaluate  the  relative  influence  of  each  factor.  The  voicing  para- 
meter is  used  to  label  the  voiced  samples  as  1 and  unvoiced  samples 
as  -1.  In  the  present  sampling  procedure  we  have  utilized  spectral 
correlations  and  intensity  changes  to  measure  changes  starting  from 


the  first  sanple  that  exceeds  the  threshold.  All  these  variables 
are  monitored  sequentially  and  the  combined  correlations  are  cal- 
culated through  the  use  of  these  parameters.  When  the  combined 
correlations  drop  below  a preset  criterion,  say,  £ = 0.96,  a sample 
is  selected.  Starting  from  the  selected  sample  this  process  is 
repeated  until  the  whole  utterance  is  exhausted. 

The  combined  spectral  correlations  and  intensity  changes 
between  two  samples  i and  i'  are  calculated  as  follows: 

Combined  correlations  = C. . , - D • |A. , - A. I 

n c 1 i'  i' 

where  Dc  is  the  parameter  weighing  the  relative  importance  of  in- 
tensity change  with  respect  to  spectral  change.  In  general,  in  ah 
utterance  of  three  connected  minerals,  the  total  nunber  of  such 
samples  varies  between  30  and  40.  The  numbers  are  nearly  indepen- 
dent of  time  but  vary  with  habit  and  dialect.  In  Figure  1A,  the 
column  labelled  N gives  the  normalized  sanples  selected  by  the 
system.  The  dashed  lines  are  proportional  to  C^,  for  i * 60,  the 
sample  with  the  peak  energy. 

B.3  Method  of  Performing  Matches  Between  Templates  and  an 
Unknown  Utterance. 

Matches  to  a template  are  found  by  "walking"  the  template 
through  the  unknown  utterance  and  finding  the  best  correlation  of 
selected  samples  in  the  template  with  selected  sarnies  in  the 
utterance.  The  best  correlation  for  a given  tentative  position  of 
the  first  selected  sample  in  the  template  is  found  by  a method  of 
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time -warping,  as  follows  (see  Fig.  2):  The  tentative  position 

% 

of  the  tenplate-match  in  the  utterance  is  marked  by  the  first 
selected  sanple  of  the  tenplate.  The  second  template  sample 
chooses  the  best-correlating  of  the  next  two  utterance  samples, 
and  latches  onto  it.  The  third  template  sample  chooses  between 
the  two  utterance  samples  following  the  utterance  sanple  chosen 
by  the  second  tenplatc  sample.  The  matching  proceeds  in  this  way 
until  the  last  sample  of  the  template  is  matched,  and  then  a cumu- 
lative correlation  is  computed  with  the  chosen  samples  in  the 
utterance.  This  is  the  template  score.*  Since  the  piece  of  the 
utterance  that  matches  the  tenplato  always  has  more  selected  sam- 
ples (possibly  as  many  as  twice  the  number  in  the  template)  it  is 
necessary  to  choose  fewer  selected  samples  in  the  tcnplate  than  in 
the  corresponding  utterance:  The  sampling  speed  mast  be  reduced 
for  template-making. 

At  each  selected  sample  in  the  utterance,  the  names  and 
scores  of  the  three  best-matching  templates  starting  at  that 
selected  sanple  are  stored,  together  with  the  corresponding  lengths 
of  the  matching  utterance  in  selected  samples.  Template  scores 
that  are  below  a tenp late -dependent  threshold  are  reset  to  zero. 

B.4  Editing  Rules  for  Evaluating  Word  Matches. 

The  editing  program  stitches  together  the  above  template  matches 
to  produce  word  identifications.  To  do  this,  it  first  scans  the 
template-matching  data  in  temporal  order  until  it  finds  template  matches 
that  are  appropriately  ordered  and  spaced  (within  a tolbrance  of  one 
selected  sample)  that  constitute  a likely  instance  of  a vocabulary 
* Actually,  all  scores  arc  correlations  multiplied  by  200. 
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word.  A word  score  is  computed  as  the  mean  of  the  (1  to  6)  constituent- 
template  scores,  and  word  scores  below  a word- dependent  threshold  are 
reset  to  zero.  After  this,  the  program  looks  for  later  instances  of  any 
vocabulary  word,  but  skips  the  part  of  the  utterance  already  matched 
by  the  sequence  of  tenplates  in  the  completed  identification. 

Because  this  stitching  algorithm  favors  matches  at  the  begin- 
ning of  the  utterance  that  eclipse  later,  higher  matches,  a correc- 
tion called  creep  is  introduced.  This  delays  finalizing  a word 
identification  until  the  program  scans  downstream  one-third  the 
length  of  the  utterance  segment  corresponding  to  the  tentatively- 
identified  word.  If  it  finds  a higher  score,  the  new  word  pre- 
empts the  old  one;  this  process  is  iterated  until  the  end  of  the 
utterance,  if  necessary.  (See  Fig.  IB  for  a sample  recognition  plot.) 

For  the  vocabulary  of  the  present  contract,  we  often  found  a 
spurious  "eight"  riding  in  the  wake  of  identified  words  such  as 
"three".  To  eliminate  such  errors  (facilitated  by  the  shortness 
of  the  typical  "eight"  template)  all  putative  "eights"  had  to 
satisfy  one  of  the  following  conditions: 

a)  The  word  score  is  greater  than  180. 

b)  The  energy  increases  sanetime  during  the  putative  "eight". 

c)  The  energy  of  at  least  one  selected  sample  exceeds  2/5  of 
that  of  the  nearest  voiced-peak  maximum. 

B.5  Speed-Up  of  The  Template-Matching  Routine. 

The  most  time-consuming  part  of  the  recognition  algorithm 
is  the  correlation  of  the  selected  samples  of  the  template  with  the 
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unknown  utterance.  This  is  so  because  the  operation  must  be  done 
so  many  times,  and  also  because  the  correlation  measure  "involving 
many  multiplications- -is  intrinsically  time-consuming  to  compute. 

We  tried  two  methods  to  reduce  this  time,  and  thus  bring  the  recog- 
nition closer  to  real-time. 

Our  first  attempt  was  to  replace  the  correlation  with  a simp- 
ler measure  of  the  difference  between  compared  spectra.  We  tried 
the  mean  error  (the  sum  of  the  absolute  differences  of  the  fil- 
trates of  the  conpared  spectra)  because  it  has  no  multiplications 
at  all.  Although  time  was  saved  by  this  method,  our  recognition 
results  deteriorated  to  an  unacceptable  level.  Undoubtedly  this 
was  because  much  of  the  rest  or  our  recognition  algorithm  was  pre- 
dicated on  the  use  of  the  correlation.  Therefore,  we  returned  to 
the  correlation  in  the  interests  of  saving  system  development  time. 

Our  second  attempt  was  more  successful:  We  sought  to  reduce 
the  number  of  times  the  correlation  was  executed  by  rejecting  tem- 
plate-matches after  a few  low-scoring  selected  samples  were  corre- 
lated. The  template  correlation  was  abandoned  at  sample  1 if  the 
correlation  at  sample  1 was  less  than  ZTHR;  it  was  abandoned  at 
sample  2 if  the  correlation  for  the  first  two  samples  was  less  than 
ZTHR- ZINC;  it  was  abandoned  at  sample  n if  the  correlation  for  the 
first  n sairples  was  less  than  ZTHR-nZINC.  The  final  value  of  ZTHR- 
nZINC  was  the  tenplate  threshold.  By  establishing  this  initially 
rigorous  but  subsequently  relaxing  criterion  for  continuing  corre- 
lation, we  were  able  to  save  30-40%  of  the  recognition  time. 
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Our  initial  attempts  at  speaker- adaptation  during  the 
period  of  this  contract  were  guided  by  the  theoretical  principles 
of  Section  2:  We  sought  to  remove  from  word  templates  the  spectra 
of  the  data-base  speaker's  voice  and  insert  the  voice  spectra  of 
the  new  speaker.  The  method  we  enp loved  had  three  basic  parts: 

B.6.1.  Making  Skeletons  From  a Data-Base  Speaker's  Word 
Tenplates  and  Basis-Function  Templates. 

For  each  vowel  name  (I , 0,  A,  U) , and  for  each 

unvoiced- sound  name  (S,  $)*  (extracted  from  the  preamble  "Key  Sue 

Fur  Shop”),  a 16-vector  basis  function  was  constructed  by  averaging 

the  selected  samples  in  each  basis- function  template  with  that  name, 

and  then  averaging  the  three  instances  of  the  sound.  Averaging  over 

more  instances  would  have  produced  smoother  basis  functions  similar 

to  color  theory's  response  functions.  For  each  selected  sample  in 

each  word  template,  a least-squares  best  fit  was  found  with  a 

linear  combination  of  the  basis  functions.  The  six  coefficients  in 

this  expansion  are  stored  as  integers  where  the  16  byte- filtrates 

used  to  be  in  the  template  file.  (Since  1 integer  = 2 bytes,  6 

integers  fit  into  12  of  the  16  bytes  allocated  for  the  filtrates  of 

a selected  sample  of  speech.)  The  expansion  coefficients  form  a 

skeleton  that  is  devoid  of  the  spectra  of  the  speaker's  voice. 

Our  first  experiment  with  spectral  adaptation  used  only  the 


vowels  I,  §,  A,  U,  (taken  from  the  preamble  'He  Had  Hot  Food")  to 


adapt  the  voiced  part  of  the  template,  and  simply  carried  the  un- 
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voiced  part  through  without  adaptation  (see  Section  2.1.1.).  How- 
ever, we  found  that  the  unvoiced  speech  also  needs  adaptation;  since 
our  voice/unvoice  detector  is  unreliable,  we  expanded  every  tem- 
plate sample  in  terms  of  all  6 spectra  I,  6,  A,  U,  S,  $ (taken  from 
the  preamble  "Key  Sue  Fur  Shop") , instead  of  trying  to  segregate  and 
expand  separately  the  voiced  and  unvoiced  parts  of  the  template. 

(Ultimately,  we  expect  greater  reliability  from  the  expansion  when 
we  can  reliably  partition  the  expansion  domain  into  subspaces  with 
distinct  voiced  and  unvoiced  basis  functions.) 

As  shown  in  Fig. 3,  program  SKLTRN  (a  utility  that  is  separate 
from  the  main  G0M43N  task)  operates  on  a master  template  file  (con- 
sisting of  all  speakers'  templates)  to  produce  the  master  skeleton 

u 

file.  In  the  process,  each  speaker's  basis -function  templates  are 
brought  to  bear  on  his/her  word  templates . 

B.6.2.  Reconstituting  Templates  From  Data-Base  Word 
Skeletons  And  a New  Speaker's  Basis  Functions. 

The  new  speaker’s  basis  functions  were  obtained  as 
above  from  his/her  basis- function  template  file.  For  each  selected 
sample  in  each  skeleton,  the  six  skeleton  coefficients  were  multi- 
plied by  the  respective  basis  spectra  of  the  new  speaker,  and  the 

I 

results  were  added  together  to  produce  a 16-byte  vector.  This  vec- 
tor is  stored  where  the  skeleton  coefficients  were  in  the  template 
file  (and  extending  into  the  vacant  areas  where  the  old  templates 
used  to  be  before  skeleton  creation).  The  reconstructed  templates 
are  now  endowed  with  the  voice  spectra  of  the  new  speaker.  The  pro- 
gram that  does  this  is  OUTTMP  (see  Fig.  3). 
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p.6.3.  Extracting  Basis  Phonemes  Fran  a Training  Preamble. 

For  each  data-base  speaker  and  also  for  each  new 
speaker,  a tenplate  file  was  made  containing  the  basis  phonemes  I, 

U,  0,  A,  S,  $ from  three  utterances  of  "Key  Sue  Fur  Shop".  For  the 
vowel  phonemes  (I,  U,  ti,  A),  each  template  consisted  of  two  selected 
sanples  nearest  the  voiced  energy  peak  of  the  relevant  word.  For 
the  unvoiced  sounds  (S,  $) , each  template  consisted  of  the  two  un- 
voiced selected  sanples  nearest  the  voiced  onsets  of  "Sue"  and  ".Shop", 
respectively.  Occasionally  the  $ phoneme  would  register  at  least  one 
voiced  sanple  because  of  high  energy  and  the  unreliable  low-versus- 
high  frequency'  discrimination  of  the  voice/unvoice  detector.  In  such 
cases,  the  $ generally  creates  a distinct  voiced  energv  peak;  there- 
fore, the  $ was  extracted  from  this  peak  when  it  occurred . * 

R . 7 Speaker  Adaptation  by  Speaker  Categorization  and  Template- 
File  Augmentation. 

The  results  of  recognition  experiments  performed  on  the 
spectrally  adapted  tenplates  were  not  entirely  satisfactory,  because 
the  basic  theory  was  still  developing,  the  expansion  seemed  sometimes 
to  degrade  the  template  spectra,  and  the  voice /unvoice  detector  was 
unreliable.  Also,  the  projected  time  required  to  process  the  50 
data-base  speakers  via  spectral  adaptation  exceeded  the  time  to  com- 
plete the  contract  reasonably.  Therefore,  we  adopted  a method  of 
speaker  adaptation  by  categorization.  This  method  involves  immediate 
adaptation  to  a new-  speaker  hv  switching  in  a template  file  for  the 

most  similar  1 - 3 speakers  in  the  data  base;  the  subsequent  long 

* In  some  experiments  (Section  2.1)  4 or  8 basis  functions  were 
used  instead  of  the  6 described  here. 
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term  adaptation  is  implemented  by  adding  new  templates  from  the  new 
speaker  to  correct  recognition  errors.  The  details  of  the  method 
are  as  follows: 

B.7.1.  Short-Term  Adaptation  by  .Speaker  Categorization. 

Templates  for  the  21 -word  vocabularv  are  stored 
for  each  of  44  data-base  speakers  in  separate  template  files.  Also, 
tenplates  from  an  utterance  of  ”Kcv  Sue  Fur  Shop"  are  stored  from 
each  data-base  speaker.  This  constitutes  the  data  entered  prior 
to  the  new  speaker's  introduction  to  the  machine. 

A new  speaker  begins  by  saying  "Key  Sue  Fur  Shop"  a single  time 
into  the  microphone,  and  the  computer  goes  through  the  following 
steps  in  about  two  minutes: 

a)  The  "Key  .Sue  Fur  Shop"  is  automatically  segmented  into  4 parts 
and  matched  against  the  "Key  Sue  Fur  Shop"  of  each  data-base  speaker. 

b)  The  three  best  correlation  scores  are  recorded,  and  the  tem- 
plates from  the  three  best-matching  speakers  are  entered  into  a single 
tenplate  file  (via  a special  task  CTMBTN) . 

c)  Tli is  temp 1 ate  file  is  installed  for  subsequent  word  recogni- 
tion from  the  new  speaker. 

B.7.2.  Long-Term  Adaptation  hv  Adding  Templates  From 

Recognition  F.rrors  to  the  New  Template  File. 

After  the  initial  short-term  adaptation,  the  new 
speaker  utters  the  following  triplets  of  digits  deemed  difficult  to 
recognize  by  the  machine:  118,  111,  311,  318,  418,  411,  711,  718,  911, 
918,  831,  838,  819,  841,  848,  849,  8S9,  088,  188,  288,  388,  4S8,  788, 
888,  988.  Similar  utterances  are  spoken  for  the  comnand  words.  These 
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utterances  are  stored  automatically  with  sequential  utterance- file 
names,  and  then  recalled  for  automatic  recognition.  An  operator 
notes  the  recognition  errors,  and  makes  templates  from  them.  The 
recognition  process  takes  about  five  minutes,  and  the  template- 
making (done  manually  via  the  interactive  graphics  terminal)  takes 
no  more  than  a minute  for  each  word  error. 

If  this  process  is  repeated  at  various  intervals  during  a long 
session  of  the  new  speaker  with  the  machine,  it  can  correct  for 
slow  variations  in  utterance  manner,  and  thus  is  a form  of  conti- 
nuing, long-term  speaker  adaptation. 

B . 8 Software  Overview. 

Our  programming  efforts  were  devoted  to  transferring  recog- 
nition programs  from  the  PDF  8 to  the  PDP  11/70  running  under  the 
RSX-11M  operating  system.  Midway  through  the  contract  period,  we 
changed  from  Version  2 to  Version  3 of  the  RSX-11M,  which  is  the  ver- 
sion we  are  presently  using.  We  have  developed  programs  to  enable 
the  system  to  store  utterances  automatically  and  to  interact  conven- 
iently with  human  operators  by  means  of  a light  pen  and  a CRT  graphics 
terminal.  Progress  to  date  is  described  below: 

B.8.1.  Operating  System. 

By  replacing  Version  2 of  the  RSX-1IM  with  Version 
3,  we  obtained  greater  versatility  during  program  development,  and 
also  attained  an  efficient  interface  with  hunan  operators  using  a 
light  pen  and  a VT-11  graphics  terminal.  We  also  increased  processing 
speed  without  sacrificing  versatility  by  discontinuing  the  use  of  the 
PDP  11/10  as  an  adjunct.  We  were  thus  able  to  dispense  with  PFCNFT, 
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which  was  a cumbersome- -though  reliable- -mode  of  coupling  the  11/10 
with  the  11/70. 

There  are  three  principal  tasks  in  the  recognition  program; 
all  three  exist  in  memory  together  and  are  handled  in  parallel. 
Within  each  task,  there  is  also  an  intricate  memory  management 
scheme  between  the  subroutines  and  the  system  common  blocks.  This 
required  a significant  amount  of  time  to  implement,  but  has  rewarded 
us  with  an  increased  speed  and  versatility  in  the  processing. 

Subroutines  of  each  task  are  automatically  switched  in  and  out 
of  memory  by  system  event  flags  and  system  common  blocks.  Hunan 
intervention  occurs  only  through  the  graphics  terminal,  which  dis- 
plays options  on  a CRT  and  allows  manipulation  via  typewriter  com- 
mands and  via  a light  pen. 

B. 8.2.  Task  Structure. 

There  are  three  principal  tasks  in  the  present 

recognizer: 

COMDN  Written  in  Fortran  IV  Plus  for  optimum  execution 

times  and  for  the  added  disk-handling  features,  this 
task  handles  most  disk  files  related  to  the  project. 
These  include  template  files,  word  files,  utterance 
files,  scratch  files,  and  various  utility  files  vised 
by  the  task.  COMON  also  performs  all  time -dependent 
algorithms  used  for  recognition.  It  receives  all 
its  instructions  from  the  GRAFIC  task,  via  system 
event  flags  and  parameters  left  in  the  svstem  common 
blocks. 


GRAFIC  This  task  performs  all  functions  needed  to  interface 

the  operator  with  the  project.  The  functions  include 
menu  selection,  graphic  display  of  results,  and  con- 
trol of  the  GOMDN  task. 

PRNT  This  task  prints  files  created  by  the  OTOtlON  task. 

Its  operation  is  much  like  that  of  the  print  spooler 
of  the  RSX-11M  system.  The  PRNT  task  will  also  send 
reproduced  speech  to  a speaker,  when  the  operator 
wishes  to  hear  stored  taped  data. 

The  internal  structures  of  the  CXMION  and  GRAFIC  tasks  are 
summarized  in  the  program-structure  diagrams  (Figs.  4,  5). 

B.8.3.  Program  Structure  of  The  QCM40N  Task. 

The  COM3N  task  consists  of  a nunber  of  program 
modules,  connected  as  indicated  in  Fig.  4 . The  function  of  each 

) ] 

of  the  modules  is  summarized  below. 

Controls  the  program  flow  for  the  entire  speech 
program.  It  starts  the  graphic  handler  (the  GRAFIC 
task) , and  waits  until  GRAFIC  returns  with  a menu 
selection  (e.g.,  to  train  the  recognizer  or  to  use 
it  to  perform  recognition.  Then  it  decides  which 
program  section  to  call. 

Performs  all  the  once-only  initialization  for  the 
CCMMDN  main  program. 

Manipulates  word  and  template  files,  according  to 
the  user's  specification  of  which  words  and 
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INITIT 

DIRECT 
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tenplates  arc  to  be  active  (i.e.,  candidates  for 
utterance  matches). 

EX IT IT  Performs  an  orderly  exit  from  COfPN. 

DUMP  Dumps  contnon  blocks  on  the  lineprinter,  if  the  user 

so  desires.  DUMP  is  called  through  HELP,  which  will 
soon  become  a more  general-purpose  help  module. 

TRAIN  Controls  program  flow  for  the  section  of  CCMON  that 

makes  tenplates  from  input  utterances. 

INPUT/DINPUT  Reads  input  data  from  microphone  and  disk,  respec- 
tively. (Speech  read  from  the  disk  has  already  been 
converted  to  digital  form.)  These  input  routines 
also  perform  seme  processing  of  the  data.  Whereas 
the  raw  data  are  the  energy  filtrates  from  16  filters 
evaluated  at  up  to  250  ten-millisecond  intervals,  the 
INPUT  routines  pass  these  data  to  the  rest  of  the 
program  in  four  forms:  The  RAUDIO  array  contains  all 
the  filtrates,  but  for  each  sampling  time  the  fil- 
trates are  normalized  so  the  maximum  of  the  16  fil- 
trates evaluated  at  that  time  is  255;  the  ILVSUM 
array  contains  the  sum  of  the  filtrates  evaluated  at 
each  sampling  time,  normalized  so  the  maximum  over 
the  2.5  second  window  is  5000;  the  ILOSLM  and  IHISUM 
arrays  are  respectively  the  sums  of  the  four  lowest 
frequency  and  four  highest  frequency  filtrates,  eva- 
luated at  each  sampling  time. 

L 
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V0ICE1 

NORM2 

SAMPLE 

SAVTPL 

PLOT 

RLOOG 


Evaluates  the  state  of  voicing  at  each  sampling  time, 
depending  on  the  relative  values  of  ILOSUM  and  I HI SUM. 
Thereafter,  the  speech  is  divided  into  voiced,  un- 
voiced and  gap. 

Normalizes  the  ILVSUM  in  each  voiced  region  to  5000, 
and  performs  a smoothing  function  such  that  artificial 
discontinuities  are  not  thereby  introduced  into  the 
speech  data.  N0RM2  also  eliminates  short  voicing - 
bursts  such  as  are  characteristic  of  noise. 

Evaluates  the  speech  data  in  order  to  choose  selected 
sanples  at  times  when  the  speech  is  changing  most 
rapidly.  SAMPLE  removes  some  of  the  contingency  on 
length  of  utterance,  by  performing  an  effective  time- 
normalization  on  the  speech  signal. 

Saves  templates  created  in  TRAIN,  by  storing  them  on 
the  disk. 

Plots  speech  data  (including  ILVSUM  and  selected 
sanples)  on  the  lineprinter,  for  a permanent  record 
that  affords  easy  visual  access. 

Controls  program  flow  for  the  section  of  COWON  that 
performs  recognition  of  words  in  an  unknown  utter- 
ance, based  on  matching  templates  in  the  data  base 
to  parts  of  the  utterance.  Recognition  is  performed 
after  the  unknown  utterance  is  passed  through  the 
input  routine,  and  samples  are  selected  (as  with  the 
templates) . 
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RECSUB 


RECOGA 


RECOGB 


autost 


Stores,  the  unknown  utterances , and  then  packs  the 
selected  samples  into  the  initial  buffers  of  BA.UDIO 
and  ILVSUM  (ILOSUM  and  IHISUM  are  not  used  after  the 
VDICE1  routine) . The  latter  buffers  of  these  arrays 
are  used  to  store  templates,  which  are  cycled  through 
as  candidates  for  recognition  in  RECOGA. 

Takes  candidate  tenplates  from  activated  spot  words, 
and  moves  them  through  the  unknown- utter ance  selected 
samples,  looking  for  a match.  The  best  matches  are 
found  by  a method  of  time-warping,  as  discussed  in 
Section  B.3. 

Performs  word  recognitions  from  the  tenplates  recog- 
nized in  RECOGA.  The  three  best -recognized  templates 
starting  at  each  selected  sample  of  the  unknown  speech 
are  passed  from  RECOGA  to  RECOGB.  A word  is  recog- 
nized starting  at  a given  selected  sample  if,  within 
a tolerance,  the  sequence  of  templates  in  that  word 
can  be  recognized  in  the  right  order  without  inter- 
fering with  one  another.  A word  score  is  then  compu- 
ted, which  is  the  average  of  the  correlations  of  the 
constituent  templates  with  the  unknown  speech.  The 
best  recognition,  if  its  score  exceeds  a threshold, 
is  accepted  as  a word  recognition. 

Performs  automatic  scanning  though  long  utterances  in 
2.5  second  pieces,  in  order  to  achieve  automatic  recog 
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ADAPT 


Matches  the  new  speaker's  preamble  ("Key  Sue  Fur 
Shop")  against  that  of  all  data-base  speakers,  and 
combines  into  a single  file  the  templates  of  the  3 
best-matching  speakers.  This  file  is  then  switched 
in  for  subsequent  recognition  of  the  new  speaker's 
utterances.  Sub -module  ADAPTA  does  the  categoriza- 
tion, and  ADATTB  switches  in  the  new  file. 

The  functions  of  the  modules  of  the  GRAFIC  task,  which  is 
slaved  to  the  CCMtoN  task,  reflect  the  functions  of  the  correspon- 
ding modules  in  CCM1N. 

In  addition,  there  is  a task  (PLYTSK)  that  interpolates 
spectra  between  the  selected  sanples  of  an  utterance  and  uses  the 
resulting  numbers  to  drive  sound  sources  so  the  utterances  can  be 
heard  via  an  audio  speaker.  Another  task,  COMBIN,  combines  the 

template  files  selected  in  speaker  categorization  (See  Section 
B»  7.1.). 

b-8.4.  Auxiliary  Utilities. 

We  found  it  most  convenient  to  separate  sane  of 
the  subroutines  from  the  main  COMMON  program,  in  the  interests  of 
simplifying  the  graphic  menus  where  speed  and  contiguity  were  not 
essential  (particularly  in  the  training  phase  of  the  program) . 

The  principal  utility  tasks  are  listed  below: 


STORIT 


Stores  utterances  in  files  coded  automatically  by 
speaker.  This  program  requests  a speaker  nunber 
via  the  DEC-writer  and  then  uses  a bell  to  prompt 
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the  speaker  to  make  sequential  2.5  second  utter- 
ances  into  the  microphone.  The  stored  utterances 
are  used  to  test  recognition  performance,  and  also 
include  the  preamble  "Key  Sue  Fur  .Shop". 

SKLTRN  Expands  the  16-vector  spectra  at  each  selected 

sample  of  each  spot  word  template  in  terms  of  a 
least- squares  approximation  by  a linear  combination 
of  basis  functions  characteristic  of  the  speaker 
of  the  template.  The  coefficients  are  the  speaker- 
independent  skeleton  of  the  template. 

OUTTMP  Reconstitutes  speaker- independent  temp late -skeletons 

by  multiplying  the  skeleton  coefficients  by  a new 
speaker's  basis  functions  and  adding  the  products  to 
produce  the  appropriate  linear  combination  of  basis 
functions.  The  reconstituted  templates  are  now  endowed 
with  the  voice  spectra  of  the  new  speaker. 

EDTTMP  Allows  quick  editing  of  template  files,  including 

renaming,  re-ordering,  deletion,  and  directory  listing. 
Wild  card  options  facilitate  the  editing  operation. 

B.9  Hardware  Overview. 

The  word  recognition  system  consists  of  a Digital  Equipm- 
ment  Corporation  (DEC)  PDP  11/70  computer  with  128K  of  memorv,  other 
DEC-supplied  peripheral  devices,  custom-made  audio  filters,  and 
recording  devices.  The  configuration  is  shown  in  Fig.  6. 

B.9.1.  PEC-Supplied  Peripheral  Devices. 

The  standard  peripheral  devices  are  as  follows: 


1 RP04  Disk  drive  (44  megabytes) 

2 RK05  Disk  drive  (1.2  megabytes) 

1 TU16  Magnetic  tape  drive  (9  track,  1600  BPT  maximum) 

1 LP-11  Lineprinter 

1 YT-11  Graphics  display  system  (GT-42) 

1 AR-11  Analog  real  time  system 

1 DR- 11C  General  device  interface 

3 DL-11  Asynchronous  serial  line  interface 

Initially,  the  VT-11  was  configured  in  its  own  PDP  11/10  com- 
puter. At  that  time,  this  required  DECNET  (DEC's  network  software) 
to  communicate  between  the  11/10  and  the  main  computer  (11/70). 

DECNET  proved  to  be  very  cumbersome  to  use,  so  the  VT-11  was 
installed  in  the  PDP  11/70,  which  now  handles  the  graphics  directly. 
The  only  disadvantage  of  this  arrangement  is  that  the  IT- 11  slows 
the  central  processor  considerably  when  it  is  displaying,  graphics. 
Because  of  this,  the  \T-11  is  turned  off  when  program  running 
speed  is  important.  Recently,  DEC  has  released  a more  efficient 
network  software  package,  which  we  plan  to  use  to  distribute  the 
graphic  task  once  again  to  the  PDP  11/10. 

B . 9 . 2 . Audio  Filter  Assembly  (custom-made) . 

The  audio  signal  is  first  amplified  and  passed 
through  a pre-emphasis  network  and  a band-pass  filter.  The  pre- 
emphasis network  is  an  active  RC  network  providing  6 dB/octave 
pre-emphasis  between  700  and  4000  Hz,  an  emphasis  near  300  Hz,  and 
de-emphasis  just  below  700  Hz.  The  signal  is  band-limited  hv  a 
bandpass  filter  with  24  dB/octave  slopes  and  6 dB-cutoff  frequen- 
cies of  250  Hz  and  5300  Hz. 


After  pre-emphasis , the  signal  is  sent  through  16  data  chan- 
nels, each  having  an  active  filter,  an  active  full-wave  rectifier 


S3  - 


(with  a 60  dB  dynamic  range),  and  a low-pass  smoothing  filter. 

The  active  filters  have  different  characteristics,  but  the  recti- 


fiers and  low-pass  filters  are  all  the  same.  Hie  latter  are  2 
pole  RC  filters  with  a 3-dB  cutoff  frequency  of  15  II: . The  active 
filters  for  the  16  channels  are  two-stage  multiple- feedback  fil- 
ters with  the  following  characteristics: 


Filter  No.  Center  Frequency  (lb)  0 


1 

2b0 

5.00 

«• 

317 

5.00 

3 

387 

5.00 

4 

472 

5.00 

5 

576 

5.00 

6 

694 

6.35 

7 

812 

6.35 

8 

950 

6.35 

9 

1111 

6.35 

10 

1300 

6.35 

11 

1600 

5.00 

12 

1952 

5.00 

13 

2381 

5.00 

14 

2905 

5.00 

15 

3545 

5.00 

16 

4325 

5.00 

Inputs  to  the  filter  bank  can  come  from  the  microphone  or  can 

be  generated  from 

a D/A  converter  connected 

to  the  DR- 11C  inter- 

face.  The  output 

from  the  D/A  converter  is 

feil  through  a low -pass 

filter  to  minimize  quantization  noise. 

R.9.3.  Pitch  Extractor  and  VoiceA'nvoice  Detector. 

A commercial  McMorrow  pitch  extractor  was  pur- 
chased during  the  contract  peri  oil,  but  it  was  useful  onlv  for 
steady-state  sounds  atypical  of  real  speech.  Therefore,  it  was 
not  used  in  the  present  system.  Also,  a hardware  voice./unYoice 
detector  (including  a :ero-crossing  ciruit)  which  had  functioned 


r ^ ^ — — ■ 

r — — . — 

- 54  - 

effectively  with  our  previous  PDP  8-based  recognizer,  was  not 
incorporated  into  the  present  recognizer  because  it  would  have 
required  a system  overhaul  for  which  we  could  not  afford  time. 
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Fig.  1A.  SAMPLE  TRAINING  PLOT,  "ONE-TWO-THREE"  (Second  Section) 
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Fig.  1A.  SAMPLE  TRAINING  PLOT,  "ONE-TWO-THREE"  (Third  Section) 
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Fig.  IB.  SAMPLE  RECOGNITION  PLOT,  "ONE -TWO -THREE”  (Section  1) 
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Fig.  IB.  SAMPLE  RECOGNITION  PLOT,  "ONE -TWO -THREE"  (Section  3). 


QO«T 

/or 


Fig.  2.  TCMPLATE  MATCHING  WITH  THF.  RECOGNITION 
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Fig.  3.  SKELETON  EXPANSION  AND  TEMPLATE  RBOCNSTITUTION 


Pro 


’GRAFIC”  TASK 


